Iterative Universal Hash Function Generator for Minhashing

نویسنده

  • Fabrício Olivetti de França
چکیده

Minhashing is a technique used to estimate the Jaccard Index between two sets by exploiting the probability of collision in a random permutation. In order to speed up the computation, a random permutation can be approximated by using an universal hash function such as the ha,b function proposed by Carter and Wegman. A better estimate of the Jaccard Index can be achieved by using many of these hash functions, created at random. In this paper a new iterative procedure to generate a set of ha,b functions is devised that eliminates the need for a list of random values and avoid the multiplication operation during the calculation. The properties of the generated hash functions remains that of an universal hash function family. This is possible due to the random nature of features occurrence on sparse datasets. Results show that the uniformity of hashing the features is maintaned while obtaining a speed up of up to 1.38 compared to the traditional approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simple Pseudorandom Number Generator with Strengthened Double Encryption (Cilia)

A new cryptographic pseudorandom number generator Cilia is presented. It hashes real random data using an iterative hash function to update its secret state, and it generates pseudorandom numbers using a block cipher. Cilia is a simple algorithm that uses an improved variant of double encryption with additional security to generate pseudorandom numbers, and its performance is similar to double ...

متن کامل

Quantum Hashing via Classical $\epsilon$-universal Hashing Constructions

We define the concept of a quantum hash generator and offer a design, which allows one to build a large number of different quantum hash functions. The construction is based on composition of a classical ǫ-universal hash family and a given family of functions – quantum hash generators. The relationship between ǫ-universal hash families and error-correcting codes give possibilities to build a la...

متن کامل

A Cookbook for Black-Box Separations and a Recipe for UOWHFs

We present a new framework for proving fully black-box separations and lower bounds. We prove a general theorem that facilitates the proofs of fully black-box lower bounds from a one-way function (OWF). Loosely speaking, our theorem says that in order to prove that a fully black-box construction does not securely construct a cryptographic primitive Q (e.g., a pseudo-random generator or a univer...

متن کامل

An Improved Hash Function Based on the Tillich-Zémor Hash Function

Using the idea behind the Tillich-Zémor hash function, we propose a new hash function. Our hash function is parallelizable and its collision resistance is implied by a hardness assumption on a mathematical problem. Also, it is secure against the known attacks. It is the most secure variant of the Tillich-Zémor hash function until now.

متن کامل

Leftover Hash Lemma, Revisited

The famous Leftover Hash Lemma (LHL) states that (almost) universal hash functions are good randomness extractors. Despite its numerous applications, LHL-based extractors suffer from the following two limitations: – Large Entropy Loss: to extract v bits from distribution X of minentropy m which are ε-close to uniform, one must set v ≤ m − 2 log (1/ε), meaning that the entropy loss L def = m − v...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1401.6124  شماره 

صفحات  -

تاریخ انتشار 2014